## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity" "citric.acid"
## [5] "residual.sugar" "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide"
## [9] "density" "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871 Min. :2.720
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090
## Median :0.04300 Median : 34.00 Median :134.0 Median :0.9937 Median :3.180
## Mean :0.04577 Mean : 35.31 Mean :138.4 Mean :0.9940 Mean :3.188
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280
## Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390 Max. :3.820
## sulphates alcohol quality
## Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000
## Median :0.4700 Median :10.40 Median :6.000
## Mean :0.4898 Mean :10.51 Mean :5.878
## 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :1.0800 Max. :14.20 Max. :9.000
## vars n mean sd median trimmed mad min max range skew
## X 1 4898 2449.50 1414.08 2449.50 2449.50 1815.44 1.00 4898.00 4897.00 0.00
## fixed.acidity 2 4898 6.85 0.84 6.80 6.82 0.74 3.80 14.20 10.40 0.65
## volatile.acidity 3 4898 0.28 0.10 0.26 0.27 0.09 0.08 1.10 1.02 1.58
## citric.acid 4 4898 0.33 0.12 0.32 0.33 0.09 0.00 1.66 1.66 1.28
## residual.sugar 5 4898 6.39 5.07 5.20 5.80 5.34 0.60 65.80 65.20 1.08
## chlorides 6 4898 0.05 0.02 0.04 0.04 0.01 0.01 0.35 0.34 5.02
## free.sulfur.dioxide 7 4898 35.31 17.01 34.00 34.36 16.31 2.00 289.00 287.00 1.41
## total.sulfur.dioxide 8 4898 138.36 42.50 134.00 136.96 43.00 9.00 440.00 431.00 0.39
## density 9 4898 0.99 0.00 0.99 0.99 0.00 0.99 1.04 0.05 0.98
## pH 10 4898 3.19 0.15 3.18 3.18 0.15 2.72 3.82 1.10 0.46
## sulphates 11 4898 0.49 0.11 0.47 0.48 0.10 0.22 1.08 0.86 0.98
## alcohol 12 4898 10.51 1.23 10.40 10.43 1.48 8.00 14.20 6.20 0.49
## quality 13 4898 5.88 0.89 6.00 5.85 1.48 3.00 9.00 6.00 0.16
## kurtosis se
## X -1.20 20.21
## fixed.acidity 2.17 0.01
## volatile.acidity 5.08 0.00
## citric.acid 6.16 0.00
## residual.sugar 3.46 0.07
## chlorides 37.51 0.00
## free.sulfur.dioxide 11.45 0.24
## total.sulfur.dioxide 0.57 0.61
## density 9.78 0.00
## pH 0.53 0.00
## sulphates 1.59 0.00
## alcohol -0.70 0.02
## quality 0.21 0.01
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
## 1 1 7.0 0.27 0.36 20.7 0.045 45
## 2 2 6.3 0.30 0.34 1.6 0.049 14
## 3 3 8.1 0.28 0.40 6.9 0.050 30
## 4 4 7.2 0.23 0.32 8.5 0.058 47
## 5 5 7.2 0.23 0.32 8.5 0.058 47
## 6 6 8.1 0.28 0.40 6.9 0.050 30
## total.sulfur.dioxide density pH sulphates alcohol quality
## 1 170 1.0010 3.00 0.45 8.8 6
## 2 132 0.9940 3.30 0.49 9.5 6
## 3 97 0.9951 3.26 0.44 10.1 6
## 4 186 0.9956 3.19 0.40 9.9 6
## 5 186 0.9956 3.19 0.40 9.9 6
## 6 97 0.9951 3.26 0.44 10.1 6
Omitting some features that appeared less interesting for brevity:
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
##
## L M H
## 183 3655 1060
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 6.85 0.84 6.8 6.82 0.74 3.8 14.2 10.4 0.65 2.17 0.01
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 0.28 0.1 0.26 0.27 0.09 0.08 1.1 1.02 1.58 5.08 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 6.39 5.07 5.2 5.8 5.34 0.6 65.8 65.2 1.08 3.46 0.07
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 0.05 0.02 0.04 0.04 0.01 0.01 0.35 0.34 5.02 37.51 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 138.36 42.5 134 136.96 43 9 440 431 0.39 0.57 0.61
## 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% 100%
## 87.00 102.00 113.00 124.00 134.00 147.00 160.00 176.00 195.00 241.03 440.00
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 0.99 0 0.99 0.99 0 0.99 1.04 0.05 0.98 9.78 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 3.19 0.15 3.18 3.18 0.15 2.72 3.82 1.1 0.46 0.53 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 10.51 1.23 10.4 10.43 1.48 8 14.2 6.2 0.49 -0.7 0.02
Use a variety of plot types to get an indication of whether a given feature varies significantly with quality
Ommiting several features that appeared less interesting for brevity:
See above
residual.sugar - Interesting due to it’s long thick tail, compared to the other features with some suggestion that it is in fact bi-modal. Some suggestion that higher quality wines have lower levels than average wines
alcohol - By far the clearest indicator of quality, with higher alcohol content indicating higher quality. Also the distribution was much more Platykurtic compared with the other input variables which mostly tended to be Leptokurtic
(Suspect there is a relationship between high alcohol and low residual.sugar?)
chlorides - Looks like higher quality wines have lower chloride levels
density - Looks like higher quality wines have lower density
Not exactly but I did create a variation of the output variable - quality.category
Rather than have to consider all 10 possible quality levels I instead had 3 categories (Low, Med, High) to simplify analysis
residual.sugar had a long thick tail, so I performed a log 10 transformation which enabled me to get a clearer idea of where the bulk of the data lay.
As well as a large peak around 2(ish) there was a shorter/fatter (but roughly equal in size) peak around 10.
Produce a bi-variate matrix to show correlation/distribution between each pair of features to steer further analysis. Produce a matrix for all the data and then again for just the higher quality wines to see (if) how they vary.
This takes ages to produce, so I pre-prepared images (and commented out code):
I will use this output to choose which bi and multi variate plots to produce
Scatter plots to compare individual features against quality.
Omitting several features for brevity:
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$volatile.acidity
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2215214 -0.1676307
## sample estimates:
## cor
## -0.194723
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$volatile.acidity
## S = 2.3434e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1965617
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$residual.sugar
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12524103 -0.06976101
## sample estimates:
## cor
## -0.09757683
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$residual.sugar
## S = 2.1191e+10, p-value = 8.822e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.08206979
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$chlorides
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2365501 -0.1830039
## sample estimates:
## cor
## -0.2099344
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$chlorides
## S = 2.5743e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.3144885
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$total.sulfur.dioxide
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2017563 -0.1474524
## sample estimates:
## cor
## -0.1747372
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$total.sulfur.dioxide
## S = 2.3436e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1966803
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$density
## S = 2.6406e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.348351
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$alcohol
## S = 1.096e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4403692
(more on this in the Multivariate section)
confirmed that there is a strong positive correlation between alcohol and quality.
surprised by the very small correlation between residual.sugar and quality. Curiously it looks like (on average) levels of residual.sugar start low for low quality wines, rise for medium quality wines and dip again as we move into high quality wines. What if anything does this suggest?
it’s beginning to look like there is a recipe for creating higher quality wines that is something like:
(possibly) Lower levels of total.sulfur.dioxide
(I’m ignoring density because I think it is a direct consequence of levels of the above)
there is just a hint that there is an alternative recipe for higher quality wines (difficult to confirm given the relatively small number of samples):
Higher levels of chloride ~(0.06)
strong +ve correlation between total.sulfur.dioxide and density and corresponding strong -ve correlation between total.sulfur.dioxide and alcohol
Surprised that whilst we observe a strong -ve correlation between fixed.acidity and pH (as you might expect) there is little to no relationship between volatile.acidity and pH
Strongest relationship between a feature and the output variable (quality) was for ‘alcohol’ with (Pearson) correlation of 0.436
Overall the strongest correlation was between residual.sugar and density with (Pearson) correlation of 0.839. Closely followed by alcohol and density (-0.78).
Look for correlations between input variables, start with those that seem to have a significant impact upon quality and indeed split by quality (to provide a third and thus multivariate plot)
##
## Pearson's product-moment correlation
##
## data: wines$chlorides and wines$alcohol
## t = -27.016, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3843183 -0.3355673
## sample estimates:
## cor
## -0.3601887
##
## Spearman's rank correlation rho
##
## data: wines$chlorides and wines$alcohol
## S = 3.0763e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5708064
##
## Pearson's product-moment correlation
##
## data: wines$density and wines$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
##
## Spearman's rank correlation rho
##
## data: wines$density and wines$alcohol
## S = 3.568e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.8218551
##
## Pearson's product-moment correlation
##
## data: wines$residual.sugar and wines$alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312
##
## Spearman's rank correlation rho
##
## data: wines$residual.sugar and wines$alcohol
## S = 2.8304e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.4452574
again we see that there is a cluster (albeit small) of high quality wines and see that there looks like:
low alcohol + high sugar = high quality
hard to see because there are so many medium quality wines but the density of high quality wines with alocohol 11 - 14 and residual.sugar < 5 combined with the ggsmooth line illustrates that this is the sweetspot. it is also telling that there are comparatively few low and medium quality wines in this range
##
## Pearson's product-moment correlation
##
## data: highwines$volatile.acidity and highwines$alcohol
## t = 19.179, df = 1058, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4618272 0.5512684
## sample estimates:
## cor
## 0.5079155
##
## Pearson's product-moment correlation
##
## data: wines$density and wines$chlorides
## t = 18.624, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2308679 0.2831779
## sample estimates:
## cor
## 0.2572113
##
## Spearman's rank correlation rho
##
## data: wines$density and wines$chlorides
## S = 9629500000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5083018
Article discussing relationhsip between alcohol and r.s also suggests you need high acidity with high sugar
##
## Pearson's product-moment correlation
##
## data: wines$residual.sugar and wines$fixed.acidity
## t = 6.2537, df = 4896, p-value = 4.348e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06116674 0.11673612
## sample estimates:
## cor
## 0.0890207
##
## Spearman's rank correlation rho
##
## data: wines$residual.sugar and wines$fixed.acidity
## S = 1.7494e+10, p-value = 6.955e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1067249
##
## Pearson's product-moment correlation
##
## data: highwines$residual.sugar and highwines$fixed.acidity
## t = 8.3967, df = 1058, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1926383 0.3055648
## sample estimates:
## cor
## 0.2499514
residual.sugar vs fixed.acidity
[residual.sugar vs volatile.acidity]
for completeness I also checked res.sug against volatile.acidity, this showed virtually no correlation for any quality
##
## Pearson's product-moment correlation
##
## data: wines$pH and wines$fixed.acidity
## t = -32.934, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4485154 -0.4026542
## sample estimates:
## cor
## -0.4258583
##
## Pearson's product-moment correlation
##
## data: wines$residual.sugar and wines$total.sulfur.dioxide
## t = 30.669, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3776791 0.4246712
## sample estimates:
## cor
## 0.4014393
During the bi-variate section it looked like:
similarly, Big cluster of high quality wines with ‘low chlorides’
similarly, Smaller cluster of high quality wines with ‘high chlorides’
The multivariate plot between chlorides and alcohol re-inforces this suspision, it looks like we see
We appear to be seeing similar behaviour for these two:
It’s less clear medium (and low) quality wines fall between these clusters
(please note: the ‘=’ is misleading because of course there are also lower quality wines that follow the same pattern)
The strong positive correlation between volatile.acidity and alcohol for higher quality wines was surprising given there was virtually no correlation for lower and medium quality wines
fixed.acidity vs alcohol = -0.3
NO
## Warning in loop_apply(n, do.ply): position_stack requires constant width: output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width: output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width: output may be incorrect
Alcohol is by far the single biggest contributor to the quality of white wine (in this sample). As a crude measure it has a correlation (Pearson) value of 0.436, over twice as much as the next most significant (chlorides at -0.21)
The boxplots show quite clearly the marked increase in quality as you increase the level of alcohol, with a sweetspot between around 11 to 13.
The histogram and frequency polygon both show that the distribution of low and medium quality wines is ‘quite similar’ whereas (as expected) the distribution of high quality wines is noticeably right shifted (higher alcohol).
It’s also worth noting that there is a smaller peak of higher quality wines with lower alcohol content (around 9%). So, we have:
A similar pattern could be observed for certain other features. For example:
small cluster of high quality wines with high residual.sugar
small cluster of high quality wines with high chlorides
We’ll look at how these contribute together towards the quality of wine in the following plots
Here we have pairwise plotted the features (alcohol, residual.sugar and chlorides) against each other and distinguished quality by colour and shape, in an attempt to see whether any/each pair re-inforces each other. In particular to see whether the high quality clusters (large and small) outlined earlier still exist when we combine features.
Two views of each pair are shown, the left one with a facet wrap by quality so we can see each distribution in isolation (hopefully showing the two high quality clusters) and the right one with the quality overlayed (hopefully showing differences between the category levels)
Although it is difficult to be certain, particularly because of the uneven distribution (there are many more medium quality wines), in all pairs it looks like the two high quality clusters hypothesis still holds.
Considering each in turn:
There does appear to be a large clustering of blue (high quality) top right with low chlorides and high alcohol, though it’s not very dense. Similarly there appears to be a smaller cluster bottom right (low alcohol and high chlorides).
Moreover (with a bit of a squint) it looks like the critical mass of medium quality wines sits somewhere between the two clusters
Again there is a large cluster of blue (high quality), top left (high alcohol and low residual.sugar). Density (of blue) than reduces as alcohol decreases and residual.sugar increases until we see a smaller blue cluster with low alcohol (~ 9) and high residual.sugar (~ 12 - 15)
Low and medium quality wines tend to have a lower alcohol content and a more evenly spread residual.sugar
This time we see the larger cluster with low residual.sugar and low chlorides. The smaller cluster with high values for each. That said it’s not as clear as with the other pairs considered.
There are two recipes for a good white wine:
In actual fact there are plenty of poorer quality wines that follow these recipes but you will increase your probability of having a high quality wine.
When plotting volatile.acidity against alcohol split by quality we see a marked difference for high quality wines compared to the rest. We see a fairly strong positive correlation (Pearson 0.5) for high quality wines vs virtually nothing for the rest.
So, volatile.acidity might be another feature worth considering when producing/predicting/evaluating white wine.
All the analysis suffers from one big flaw which is my fundamental ignorance of chemistry (subject matter expertise - missing), meaning that observations that seem interesting to me might well be obvious/inevitable and vice versa.
Also this is a relatively small data set and thus subject to large errors and wrong conclusions.
The vast majority of the data had mid ranging quality values (5 or 6), it was therefore difficult to compare, in practice I suspect there was more distance between those wines than a difference of just one point would suggest. A finer scale might have enabled more interesting analysis and/or the individual ratings per reviewer/per wine rather than an aggregated score per wine.
I found it difficult (and eventually abandoned it) to abstract repeated code into functions. This was largely due to the fact that for most plots I had to forensicly/manually calculate bounds.
Moving forwards it would (perhaps) be interesting to apply a clustering model (e.g. K-Means) to the high quality wines to determine whether the two categories do really exist.
Additionally we could look to train a predictive model (e.g. logistic regression or neural network) to predict the quality of wine (perhaps explicitly favouring the selected features)